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Acronyms 


Application specific integrated circuit (ASIC) - 


Block random access memory (BRAM) 


Block Triple Modular Redundancy (BTMR) 


Clock (CLK or CLKB) 

Combinatorial logic (CL) 
Configurable Logic Block (CLB) 
Digital Signal Processing Block (DSP) 


Distributed triple modular redundancy 
(DTMR) 


Edge-triggered flip-flops (DFFs) 
Equivalence Checking (EC) 

Error detection and correction (EDAC) 
Field programmable gate array (FPGA) 
Gate Level Netlist (EDF, EDIF, GLN) 


Global triple modular redundancy (GTMR) 


Hardware Description Language (HDL) 
Input — output (I/O) 
Linear energy transfer (LET) 


Local triple modular redundancy (LTMR) 


Look up table (LUT) 


Operational frequency (fs) 
Power on reset (POR) 
Place and Route (PR) 


Radiation Effects and Analysis Group 
(REAG) 


Single event functional interrupt (SEFI) 
Single event effects (SEEs) 

Single event latch-up (SEL) 

Single event transient (SET) 

Single event upset (SEU) 

Single event upset cross-section (O¢-y) 
Static random access memory (SRAM) 
System on a chip (SOC) 
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Agenda @ 


Field Programmable Gate Array (FPGA) versus 
Application Specific Integrated Circuit (ASIC) Devices. 


What’s Inside An FPGA? 
FPGAs And Critical Applications. 
Single Event Upsets in FPGA Configuration. 


Single Event Upsets in an FPGA’s Functional Data Path 
and Fail-Safe Strategies. 


Fail-Safe Strategies for FPGA Critical Applications. 
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Definitions Nasal 


¢ A Field-Programmable Gate Array (FPGA) isa 
semiconductor device containing configurable logic 
components called "logic blocks", and configurable 
interconnects. Logic blocks can be configured to perform 
the function of basic logic gates such as AND, and XOR, or 
more complex combinational functions such as decoders 
or mathematical functions. 

¢ An application-specific integrated circuit (ASIC) is an 
integrated circuit designed for a particular use, rather than 
intended for general-purpose use. Processors, RAM, ROM, 
etc are examples of ASICs. 


¢« An FPGA Is made out of an ASIC 


et 
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Creating A Design in An Integrated Sy 
Circuit Device (FPGA or ASIC) 
¢ The idea is to describe a hardware 


design using hardware description FDL 
language (HDL): i 


= Cloaks CombinatorialtSe uential Blocks 
-seqveriicimons OGG NG SIE 
— Combinatorial logic. Comiguraiion P hnoto-Mlask 


¢ The description gets synthesized into 
a hardware gate-level-netlist (GLN: file 
listing gates and connectivity). 


¢ The synthesized hardware gates are 
mapped and placed into the cell 
library (or logic blocks) of the target 
FPGA or ASIC. 
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Design Tools Nasal 


¢ Design tools are used for each step of the design process. 


e Synthesis: maps HDL into logic blocks (cells) ... outputs 
gate-level net-lists. 


e Place and route (PR): optimizes where the logic blocks 
and their interconnects should be. 


¢ Synthesis along with place and route tools contain 
optimization algorithms within their tool sets. 


— These algorithms are used to optimize area, power, and logic 
function. 

— Tools are difficult and can produce incorrect functional logic. 

— Equivalence checking (EC) verifies tool output matches HDL. 

— Poorly designed tools can create designs that are too large to fit 
into the target device or output too much power. Hence, produce 
unusable designs. 

Best practice is to use a proven vendor’s tool set -— or 


product might be unreliable or unusable. 
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HDL: Hardware description language 
STA: Static timing analysis 


EC: Equivalence checking ASIC Design Flow 


a elareiicedar-l HDL 
Specification sY-larcWlele-lme)aileircuiceyal 


Synthesis ot > 


Floorplanning, 
clock balancing, 
place and route, 

and timing closure 
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FPGA Design Flow 
FPGAs are created by manufacturers and are sold to 
users. The user maps a design into the FPGA fabric. 


User Design 


Flow wN 


User maps a 
design into 
FPGA circuits 


aR 
-* SO : = 


GAs are sold 
to users with gY, 
configurable Sp 
logic blocks and 
routes (they do 
not contain 
operable design) 
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HDL: Hardware description language 


STA: Static timing analysis FPGA User Desi : 


EC: Equivalence checking 


Functional 
Specification 


Synthesis 
oy VA = Omer: late mers ics 


Looks like 
M=\V(=) ey iaaleitshareyal 


ASIC design 
flow ... but > 
...without the ~S. 

STA, and back 


wait time annotated Gate Level 
Simulation 


User creates a design 
that is mapped intoa 
manufacturer provided 
FPGA 
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FPGA or ASIC? 
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FPGA and ASIC Devices ... System Ss 
Usage 
¢ An FPGA (similarly to an ASIC) can be used to solve 
any problem which is computable: 
— User implements a digital (or mixed signal design). 


— Design can be trivial glue-logic (e.g., interface control) or 
— Design can be as complex as a system on a chip that may 
include processors, embedded memory, and high speed 
serial interfaces (Gigabit SERDES). serves: serializer de-serializer 
e The number of gates contained within the original 
FPGA devices were too small to compete with the 
ASIC devices of that time (1980s). 
— FPGAs were mostly used as interface glue logic. 
— Reduced system cost and added flexibility. 


e Modern-day FPGAs contain millions of gates and have 
taken over a significant amount of the ASIC market. 
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The ASIC Advantage 


ce) 


PANS) [Om Ve WE-Tal rclels Comment/Explanation 


Full custom 
capability 


Lower unit costs 


Smaller form 
factor 


No configuration 


Lower power 


The design is “tailored” and is 
manufactured to design 
specifications (no additional hidden 
logic) 

Great for very high volume projects 


Less logic is required because device 
is manufactured to design specs 


Overall reliability can decrease due to 
the addition of configuration 
technology/logic 


Less logic is required because device 
is manufactured to design specs 
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The FPGA Advantage ee) 


Faster time-to-market No layout, masks or other 
manufacturing steps are needed 


No upfront non-recurring Costs typically associated with an 
expenses (NRE) ASIC design 
Simpler design cycle Due to the required tools that handle 


routing, placement, and timing 


More predictable project cycle Due to elimination of potential re-spins 
and lack of concern regarding wafer 
Capacities as it would be in ASICs 


Field reprogramability It is easier to change a design ina 
system 

Engineer availability More students are taught FPGA design 
in school 


FPGA: Faster design cycle and cheaper to implement 
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What is Inside FPGA devices? 
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General FPGA Architecture: Fabric Containing 
Customizable Preexisting Logic...User 


Building Blocks 
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How Do FPGA’s Differ? Ss 


e Manufacturer Architecture (not all are listed): 
— Configuration, 


— User building blocks (combinatorial logic cells, sequential logic 
cells), 


— Routing, 
— Clock structures, 
— Embedded mitigation, and 
— Embedded intellectual property (IP); e.g., memories and 
processors. 
e Manufacturer design tool environment: 
— Synthesis, 
— Place and Route, and 
— Configuration management output. 


Difference in architectures and tools will affect the 
final design and design process — users be aware. 
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¢ Combinatorial logic 
(CL) blocks 
— Vary in complexity. 
— Vary in I/O. 


¢ Sequential logic blocks 
(DFF) 
— Uses global Clocks. 
— Uses global Resets. 
— May have mitigation. 
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User Maps the Design Logic into FPGA Ss 
Preexisting Logic 


Hardware design language (HDL) 


D i oy am § 


Equivalent 
Block 


| 


0 be presented by Melan 


ie Berg at the Hardened Electronics an 


d Radiation Technology (HEART) 2015 Conference, 


Chantilly, VA, April 21-24, 2015. 


FPGA Configuration (Storage of User Ss 


Design Mapping) 
ADL -rvcamerne Configuration 


@ Configuration Defines: VO CONNECTS 
Arrangement of pre-existing | ROUTING MATRIX 
logic via programmable 
switches. 
® Functionality (logic cluster) and 
® Connectivity (routes) 


@ Programmable Switch 
Types: 
2 Antifuse: One time 
Programmable (OTP), 
® SRAM: Reprogrammable (RP), PROGRAMMABLE 
or SWITCHES 
® Flash: Reprogrammable (RP). 


To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology (HEART) 2015 Conference, Chantilly, VA, April 21-24, 2015. 20 


Common FPGA Applications S 


¢ Controllers, 

¢ Dataflow and interface adaptation, 

¢ Digital signal processing (DSP), 

¢ Software-defined radio, 

¢ ASIC prototyping, 

e Medical imaging, 

¢ Robotic control (vision, movement, speech, etc..,...) 
¢ Cryptology, 

¢ Nuclear plant control, 

¢ The list goes on... 


The following short course presentations will 
provide more details. 
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nin 


Example 2: FPGA Terrestrial Application™ 


Digital Cluster 
Navigation and Rear-Seat Entertainment 
Telematics Displays Source MUXing 
Personnel Occupancy 
Detection Systems (PODS) for 
Next-Generation Airbags 


Back-Up Camera 


Blind-Spot 
Warning System 


Engine Control Module 
Back-Up Sensors 


Lane Departure 
Warning System 


Emissions Control 


Advanced Suspension 
and Traction Control 


Adaptive Cruise 


Control Multi-Axis Power 


Seat Control 
Power Steering 
Control 


Collision 


Avoidance System : 
Injector Control 


(especially diesel engines) 
http://vwww.eetimes.com/document.asp ?doc_id=1305894 
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¢ Safety: can circuits or 
humans be damaged or = Critical applications will want to 


hurt? avoid disaster. 


¢ Reliability : will the device 
operate as expected? 


¢ Availability: how often will 
the system operate as 
expected? 


e Recoverability: if the device 
malfunctions, can the 
system come back toa 
working state? 


e Can the device and its 
design be trusted (security) 
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Sources of FPGA Failure sal 


Negative bias 
Packaging and temperature OETA 


mounting @& instability (NBTI) {ection (HCH), 


Poor design perch thai 
choices y/ £ ) 

he Single event 
dielectric effects (SEEs) 
breakdown | Environmental 
(DB) , stress 


Total ionizin ; ~A Lack of 
dose (TID) \ Transistor \__ verification 
switching stress if 
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How To Protect A System from Failure 


e Investigate failure modes - understand risk: 


— Reliability testing (temperature, voltage, mechanical, and logic 
switching stresses). 


— Radiation testing: Single event effects (SEE) and total ionizing 
dose (TID). 
e Add redundancy: 
— Replication with correction. 
— Replication with detection. Requires recovery: 
¢ Switch to another device, 
e Try to recover state, 
e Start over, 
° Alert, 
¢ Do nothing... die. 
¢ Add filtration: e.g., Finite impulse response (FIR) filters 
or Constant false alarm rate filter (CFAR). 


¢ Add masking. 
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Go no Go: Single Event Hard Faults 
and Common Terminology 


¢ Single Event Latch Up (SEL): Device latches in high 
current state: 


— Has been observed in FPGA devices that are currently on the 
market. 


— Some missions choose to use the devices and design around 
the SEL. 
e¢ Single Event Burnout (SEB): Device draws high 
current and burns out. 
— Not observed in FPGA devices that are currently on the 
market. 
¢ Single Event Gate Rupture: (SEGR): Gate destroyed 
typically in power MOSFETs. 


¢ Not observed in FPGA devices that are currently on the 
market. 
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Radiation Hardened versus Ss 
Commercial FPGA Devices 


e Radiation hardened FPGA devices are available to 
users. They make the design cycle much easier! 
e They are considered hardened if: 
¢ Configuration susceptibility is reduced to an 
acceptable rate. 
Generally, less than one node per 1x10 days. 
Be careful: with millions of nodes, this can translate 
into 1 or two configuration failures per year. 
However, if the node isn’t being used, then your 
circuit may not be affected by the failure. 
¢ The following presentation will discuss FPGAs with 
embedded mitigation. 
e This presentation will focus on user inserted 
mitigation techniques. 
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Small Device Geometries Enable High Capacity 
Applications but Non-Radiation Hardened 
Devices May Require SEU bia eras 


Virtex UltraScale+ 
Kintex UltraScale+ 
Virtex UltraScale 
Kintex UltraScale 
Virtex-7 

Virtex-7Q 

Stratix 5 

Virtex 5 

Virtex 5QV 

Virtex 4QV and Virtex 4 
RT-ProASIC 
RTAX-S 


0 1 2 3 4 5 
Mm = SEU Hardened/Harder Logic Capacity - Millions 
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SEUs and FPGAs sal 


¢ lonizing particles cause upsets (SEUs) in FPGAs. 


e Each FPGA type has different SEU error signatures: 
Temporary glitch (transient), 
Change of state (incorrect state machine transitions), 
Global upsets: Loss of clock or unexpected reset, 
Configuration corruption. This includes route breakage (no 
signal can get through) — can be overwhelming. 
e The question is how to avoid system failure and the 
answer depends on the following: 
— The system’s requirements and the definition of failure, 
— The target FPGA and its surrounding circuitry susceptibility, 
— Implemented fail-safe strategies, 
Reliable design practices, 
Radiation environment. 
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Fail-safe Strategies of Single Event Ss 
Upsets (SEUs) 


¢ Although there are many sources of FPGA 
malfunction, this presentation will focus on SEUs asa 
source of failure. 


¢ The following slides will demonstrate commonly used 
mitigation strategies for FPGA devices. 


¢ What you should learn: 


— The differences between FPGA mitigation 
strategies. 


— Strengths and weaknesses of various strategies. 


— Questions to ask or considerations to make when 
evaluating mitigation schemes. 


— Which mitigation schemes are best for various 
types of FPGA devices. 
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FPGA Structure Categorization as Ss 
Defined by NASA Goddard REAG: 


Single event functional interrupts (SEF) 
SEFI out of presentation scope 


SEU cross section: oscy 


P ( fs ye oF Configuration +P (fs ) functionalLogic T F ser 


Design Oseu Configuration OsEu Functional logic SEFI Oseu 


OsEU 


4% 


Sequential and 


Combinatorial Global Routes 
logic (CL) in and Hidden 
data path Logic 


SEU Testing is required in order to characterize the 
OseyS for each of FPGA categories. 
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Preliminary Design Considerations for 
Mitigation And Trade Space 


Determine Most Susceptible Components: 
P (fs ) roy oc I Configuration a (fs ) functionalLogic + Per] 


¢ Does the designer need to add 
mitigation? 

——_ ' ¢ Will there be compromises? 

— Performance and speed, 

— Power, 

— Schedule 

— Mitigating the susceptible 

components? 

Reliability (working and mitigating 

as expected)? 

Impact to speed, power, area, reliability, and 


schedule are important questions to ask. 


To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology (HEART) 2015 Conference, Chantilly, VA, April 21-24, 2015. 33 


Trade Tale 


Optimize 


Single Event Upsets and FPGA asa 
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cP configuration ny ntP (i iy) functionalLogict - SEFI 
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Law 
Programmable Switch Implementation and 
SEU Susceptibility 


ANTIFUSE (OTP 
Via to Metal 4 SRAM (RP) 


Q 
Read or Write ry 
ms i 
Data 


Programming Bit 
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Configuration SEU Test Results and & 
the REAG FPGA SEU Model 


le ( f s _ ee Ie Configurat ion © P ( i s ) functional Logic as P. SEFI 
FPGA REAG Model 


Oxelavilelere-iareyal 
Type 


Antifuse 
P ( f 5 )eror oc P ( f s ) functionalLogic stk SEFI 
SRAM (non- 
eee P ( f s lar oc FE Configuration 
Flash 
P ( f 5 )eror oc P ( f s ) functionalLogic +P, SEFI 


Hardened SRAM 


P( f s Veer os * copper Rion tele ( f s ) functionalLogic ands SEFI 
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What Does The Last Slide Mean? vasa 


Susceptibility 
Configuration DEVe- to ey: ia bm Oceluslellarcicelarctm melei(em (1m): laleMall| oniile) osm (PB) a a) b 


Global: Clocks and Resets; 
oxeyalitelele-uuceya 


Antifuse Configuration has been designated as hard regarding 
SEEs. Susceptibilities only exist in the data paths and 
global routes. However, global routes are hardened and 
have a low SEU susceptibility. 


SRAM (non- — Configuration has been designated as the most susceptible 

mitigated) portion of circuitry. All other upsets (except for global 
routes) are too Statistically insignificant to take into account. 
E.g., itis a waste of time to study data path transients, 
however clock transient studies are significant. 


Flash Configuration has been designated as hard (but NOT immune) 
regarding SEEs. Susceptibilities also exist in the data paths and 
global routes (e.g., clocks and resets). 


Hardened Configuration has been designated as hardened (but NOT 
SRAM hard) regarding SEEs. Susceptibilities also exist in the data 
paths and global routes (e.g., clocks and resets). 
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, ,, =xample: Routing Configuration Ss 
Upsets in a Xilinx Virtex FPGA 


Look Up Table: 
LUT 


I 2 3 


| ait 
LUT o BA F 


or 


a yr 
Because multiple paths can pass through the routing 
matrix, this configuration can be catestrophic - 1.e., 
break simple mitigation 
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Fixing SRAM-based 
Configuration...Scrubbing Definition 


¢ From SEU testing, it has been illustrated that the 
configuration memory of un-hardened SRAM- 
Based FPGAs is highly susceptible to SEUs. 


e We address configuration susceptibility via 
scrubbing: Scrubbing is the act of simultaneously 
writing into FPGA configuration memory as the 
device’s functional logic area is operating with 
the intent of correcting configuration memory bit 
errors. 


Configuration scrubbing only pertains to 
SRAM-based configuration devices. 
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Warning! Nasal 


e Fixing a configuration bit does not mean that you 
have fixed the state in the functional logic path. 

e In order to guarantee that the functional logic is 
in the expected state after the configuration bit is 
fixed, either the state must be restored or a reset 
must be issued. 


Reliably getting to an expected state after a 
configuration-bit SEU (that affects the design’s 
functionality) requires one of the following: 


— Fix configuration bit + (reset or correct DFFs) or 
— Full reconfiguration. 
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Single Event Upsets in an FPGA’s Functional “as 
Data Path and Fail-Safe Strategies 
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Data-path SEUs and Their Affect At The vasa 
System Level 


¢ A system implemented in an FPGA isa 
cascade of sequential and combinatorial 
logic. 
¢ Probability of a system error due to an 
SEU depends on many factors: 
— Probability of fault Generation in a gate (SET or 
SEU). 


— Probability of error propagation — will the SET 
or SEU force the system’s next state to be 
incorrect? 
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Probability of Error Propagation in A Ss 


Data-Path 
Upsets usually occur between clock cycles: Can 
cause a system-level malfunction if the SET or SEU 
will force the system’s next state to be incorrect. 
¢ Capacitive filtration: data-path capacitance can stop 
transient upset propagation; e.g.: 
— Routing metal or heavy loading. 
— If a transient doesn’t reach a sequential element, then it most 
likely will not cause a system upset. 
e Logic masking: Redundancy and mitigation of paths can 
stop upset propagation. 
e Logic masking: turned off paths from gated logic can stop 
upset propagation. 
e Temporal delay: path delays can block temporary SEUs 
from disturbing next state calculation. 
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asa 
Fail-Safe Strategies for FPGA 
Critical App Oli : 
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Differentiating Fail-Safe Strategies: 


Detection: 
— Watchdog (state or logic monitoring). 
— Simplistic Checking ... Complex Decoding. 
— Action (correction or recovery). 
Masking (does not mean correction): 
— Not letting an error propagate to other logic. 
— Redundancy + mitigation or detection. 
— Turn off faulty path. 
Correction (error may not be masked): 
— Error state (memory) is changed/fixed. 
— Need feedback or new data flush cycle. 
Recovery: 
— Bring system to a deterministic state. 
— Might include correction. 


To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology (HEART) 2015 Conference, Chantilly, VA, April 21-24, 2015. 


Redundancy Is Not Enough S 


e Just adding redundancy to a system is not enough 
to assume that the system Is well protected. 


¢ Questions/Concerns that must be addressed fora 
critical system expecting redundancy to cure all (or 
most): 
— How is the redundancy implemented? 
— What portions of your system are protected? Does the 
protection comply with the results from radiation testing? 
— Is detection of malfunction required to switch toa 
redundant system or to recover? 
— If detection is necessary, how quickly can the detection be 
performed and responded to? 
— Is detection enough?... Does the system require 
correction? 
Listed are crucial concerns that should be addressed at 
design reviews and prior to design implementation 
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Mitigation Sy 


¢ Error Masking vs. Error Correction... there’s a 
difference. 


¢ Mitigation can be: 
— User inserted: part of the actual design process. 


— Embedded: built into the device library cells. 
¢ User does not verify the mitigation — manufacturer does. 
e Mitigation should reduce error... 
— Generally through redundancy. 
— Incorrect implementation can increase error. 


— Overly complex mitigation cannot be verified and 
incurs too high of a risk to implement. 
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Availability versus Correct Operation oe) 


e Requirements must be satisfied. 

e What is your expected up-time versus down-time 
(availability)? 

e Is correct operation well defined? Unambiguous! 

e Is system failure well defined? Unambiguous! 


¢ Can availability and correct operation be deterministic 
regardless of error signature? 
¢ Availability: 

— Flushable designs: systems than can be reset or are self- 
correcting. Availability is affected during reset or correction 
time (down-time). However, downtime is tolerable as defined 
by system requirements. 

— Non-flushable designs: System requirements are strict and 
require minimal downtime. Usage of resets are required to be 
kept at a minimum. 
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Detection and Recovery oe) 


¢ Not all mitigation schemes require detection. 


¢ Questions/Consideration: 
— If your scheme requires detection: 
¢ Can the system detect all error signatures? 
¢ Can the system detect all error signatures fast 
enough? 
¢ Do different errors require different recovery 
schemes... can the system accommodate. 
— How are you going to verify the detection and 
recovery? 
— How much downtime will there be during recovery 
(availability = detection time from error + recovery 
time — masked error time) 
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Dual Redundant Systems 
(Detection Systems) 
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Dual Redundancy Example ve) 


Compare 


Alert 


Synchronize 


Recover 
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& 


Mitigation — Fail Safe Strategies That 
Do Not Require Fault Detection but 
Provide SEU Masking and/or 
Correction: 

Triple Modular Redundancy (TMR) 
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TMR Schemes Use Majority Voting S 


Voter = IIA T2+IOAI2Z+1I0A11 


Major 


1 0 
ut of 3 
est 2 ad Triplicate and Vote 
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Triplicate and Vote oS 
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Singular Data Path 
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TMR Implementation sal 


¢ As previously illustrated, TMR can be implemented ina 
variety of ways. 


¢ The definition of TMR depends on what portion of the 
circuit is triplicated and where the voters are placed. 
¢ The strongest TMR implementation will triplicate all 
data-paths and contain separate voters for each data- 
path. 
— However, this can be costly: area, power, and 
complexity. 
— Hence a trade is performed to determine the TMR 
scheme that requires the least amount of effort and 
circuitry that will meet project requirements. 
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Block Triple Modular Redundancy: BTMR Sy 


Can Only 
Mask 
Errors 


C 3 3x the error rate with 
Opy => triplication and no 
correction/flushing 

e Need Feedback to Correct 


e Cannot apply internal correction from voted outputs 


¢ If blocks are not regularly flushed (e.g. reset), Errors 
can accumulate -— may not be an effective technique 
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Examples of a Flushable BTMR Ss 
Designs 


e Shift Registers. 


¢ Transmission channels: It is typical for 
transmission channels to send and reset after 
every sent packet. 


e Lock-Step microprocessors that have relaxed 
requirements such that the microprocessors can 
be reset (or power-cycled) every so-often. 


Transmission channel example: 


RESET 
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If The System Is Not Flushable, Then Ss 
BTMR May Not Provide The Expected 
Level of Mitigation 


¢ BTMR can work well as a mitigation 
scheme if the expected MTTF >> expected 
window of correct operation. 


¢ Clarification: If the expected time to failure 
for one block is less than the required full- 
availability window, then BTMR doesn’t buy 
you anything. 

¢ BTMR can actually be a detriment - 
complexity, power, and area, and false 
sense of performance. 
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Combine SEU Data and Classical Reliability 
Models for Mitigation Analysis 


Relibility for 1 Relibility for Mean Time to Mean Time to 
block (Rpjock) BTMR (Re-tmr_) Failure for 1 Failure BTMR 
en At Ze De st AA "(5/6 A)= 0.833/A 


Reliability:'Simplex'System'Verus'Block'TMR'Version" 


\ _ Failures 
aa = Sy stem 2 —System1'A=1/40'{failure/day)" lime 
0.8" —System2'A=1/730'(failure/day)' SEU Data 
0.7" —BTMR'of'System'1™ 
> 0.6 —BTMR'of'system'2" MTT Faryp < MTT Fpiock 
ae : Operating in this time 
clase interval will provide a slight 
scat increase in reliability. 
mn system 1 However, it will provide a 
" elatively hard design. 
oO" 500" 1000" 1500" 2000" 


Days" 


To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology (HEART) 2015 Conference, Chantilly, VA, April 21-24, 2015. 59 


What Should be Done If Availability @3y 
Needs to be Increased? 


¢ Ifthe blocks within the BTMR have a relatively high upset 
rate with respect to the availability window, then stronger 
mitigation must be implemented. 


e¢ Bring the voting/correcting inside of the modules... bring 
the voting to the module DFFs. 


DFF: Edge triggered flip-flop : ; CL: Combinatorial Logic 
The following slides illustrate the various forms of TMR that 


include voter insertion in the data-path. 


TMR Description TMR 
Nomenclature Acronym 


Local TMR DFFs are triplicated LTMR 

Distributed TMR DFFs and CL-data-paths are DTMR 
triplicated 

Global TMR DFFs, CL-data-paths and global GTMR or 


routes are triplicated XTMR 


To be presented by Melanie Berg at the Hardened Electronics and Radiation Technology (HEART) 2015 Conference, Chantilly, VA, April 21-24, 2015. 60 


Describing Mitigation Effectiveness Using oy 
A Model 


DFF: Edge triggered flip-flop CL: Combinatorial Logic 


P (f: S) pape P configuration ele (f: S) functionalLogic wile SEFI 


P(fS) prrseu —seu + P(fS) ser—seu 


Probability that an 7 
SEU ina DFF will Edits Ale an . 
manifest as an error BOLE 


in th t t manifest as an error 
EE EAL Sy olen in the next system 


clock cycle clock cycle 
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Local Triple Modular Redundancy (LTMR) vasa 


LTMR masks upsets from 
DFFs 

ze: and corrects DFF upsets if 
feedback is used 


Only the DFFs 
are triplicated 
and mitigated 


P (f: S) ore P configuration i de (f: S) puns alLogic +P SEFI 
P (oscu _szu + P(fS)ser—seu 
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Distributed Triple Modular Redundancy S 
(DTMR): DFFs + Data Paths 
All DFFs with Feedback Have Voters 


DFF 
coat Lege tee 
7 


P (f. J or configuration ued (fJ feffictionalLogic +P SEja 


P(fS) pepet® seu + P(E Sqer—scu 
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Global Triple Modular Redundancy S 


(GTMR):DFFs + Data Paths + Global Routes 
All DFFs with Feedback Have Voters 


P (f. 3 Te configuration +P (f feffictionalLogic +P Sl 


P(fS) pgpet® sev + P(E qer—scu 
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Theoretically, GTMR Is The Strongest Ss 
Mitigation Strategy... BUT... 

Triplicating a design and its global routes takes upa 

lot of power and area. 


Generally performed after synthesis by a tool- not 
part of RTL. 


Skew between clock domains must be minimized such 
that it is less than the feedback of a voter to its 
associated DFF: 


— Does the FPGA contain enough low skew clock 
trees? (each clock + its synchronized reset)x3. 


— Limit skew of clocks coming into the FPGA. 


— Limit skew of clocks from their input pin to their 
clock tree. 


Difficult to verify. 
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Currently, What Are The Biggest 
Challenges Regarding Mitigation 


Insertion? 
¢ Tool availability. 


e User’s are not selecting the correct mitigation 
scheme for their target FPGA. 


Commercial Antifuse | 
Antifuse+LTMR 
Commercial SRAM 


Commercial Flash | } 
Hardened SRAM | | 


NW General Recommendation 
___| Not Recommended but may be a solution for some situations 
ES Will not be a good solution 
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User versus Embedded Mitigation oe) 


A subset of user inserted mitigation strategies 
have been presented. 


None of the strategies are 100% fail-safe. 


Depending on the project requirements, and the 
target device’s SEU susceptibility, the most 
efficient mitigation strategy should be selected. 


The following short courses will provide 
information regarding FPGA devices that 
contain embedded mitigation. 


In most cases, devices with embedded 


mitigation do not require additional (user 
inserted) mitigation. 
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Concerns and Challenges for Ss 


Mitigation Insertion 


e User insertion of mitigation strategies in most FPGA 
devices has proven to be a challenging task because of 
reliability, performance, area, and power constraints. 

— Difficult to synchronize across triplicated systems, 

— Mitigation insertion slows down the system. 

— Can't fit a triplicated version of a design into one device. 
— Power and thermal hot-spots are increased. 


e The newer devices have a significant increase in gate 
count and lower power. This helps to accommodate for 
area and power constraints while triplicating a design. 
However, this increases the challenge of module 
synchronization. 


¢ Embedded mitigation has helped in the design process. 
However, it is proving to be an ever-increasing challenge 
for manufacturers. 
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Summary sal 


e FPGA devices have become a lucrative alternative to 
ASICs. 


¢ For critical applications, mitigation may be required. 


e Determine the correct mitigation scheme for your 
mission while incorporating given requirements: 


— Understand the susceptibility of the target FPGA and how it 
responds to other devices. 


— Investigate if the selected mitigation strategy is compatible to the 
target FPGA. 


— Calculate the reliability of the mitigation strategy to determine if 
the final system will satisfy requirements. 
¢ Although it is desirable from a user’s perspective to have 
embedded mitigation, cost seems to be driving the 
market towards unmitigated commercial FPGA devices. 
Hence, it will be necessary for user’s to familiarize 
themselves with optimal mitigation insertion and usage. 
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